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PREFACE 

This  memorandum  was  produced  as  part  of  a  program  in 
group  judgment  technology  conducted  for  the  Behavioral 
Sciences  Office  of  the  Advanced  Research  Projects  Agency. 

For  many  military  problems,  the  best  information  available 
is  the  judgment  of  knowledgeable  individua Is .  This  is  es¬ 
pecially  true  in  the  assessment  of  long-range  technological 
developments,  and  the  evaluation  of  long-range  future 
threats.  Thus  the  military  has  an  important  stake  in 
ensuring  that  the  procedures  used  for  obtaining  judgments 
are  adequately  designed  to  elicit  the  most  accurate  esti¬ 
mates  possible  from  the  community  of  experts. 

By  their  very  nature,  these  estimates  are  uncertain; 
therefore,  it  seems  reasonable  that  they  should  be  couched 
in  probabilistic  terms.  This  memorandum  discusses  ways  in 
which  the  incentive  system  imposed  on  experts  may  be  struc¬ 
tured  so  as  to  Induce  the  best  possible  performance  in  pro¬ 
babilistic  forecasts,  and  briefly  touches  on  ways  in  which 
such  forecasts  may  be  combined  into  consensus  forecasts. 

Related  material  may  be  found  in  RM-5888— PR,  RM-5957— PR, 
RM-6115— PR,  and  RM-6118-PR. 
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SUMMARY 

It  seciua  worthwhile,  when  asking  groups  of  experts  for 
forecasts  of  political,  economic,  or  military  events,  to 
occasionally  require  them  to  put  their  forecasts  in  proba¬ 
bilistic  terms. 

The  advantages  of  such  an  approach  are  as  follows: 

(a)  It  provides  a  concise  expression  of  sub¬ 
jective  uncertainty. 

(b)  It  provides  an  operational  self-ratine  as 
to  the  degree  of  confidence  to  be  placed  in  the  forecast. 

(c)  It  is  readily  usable  in  decision-theoretic 

models . 

(d)  It  is  easily  combined  with  other  forecasts 
couched  in  similar  terms. 

To  Induce  accurate  forecasts,  it  seems  reasonable  to  re¬ 
ward  these  experts  by  a  scheme  related  to  the  extent  to  which 
their  forecasts  "come  true."  Such  a  scheme  must  be  care¬ 
fully  chosen  lest  It  contain  built-in  incentives  to  exag¬ 
gerate  or  to  understate  probabilities.  Systems  which  are 
free  of  distorting  incentives  are  called  "reproducing  scoring 
systems";  such  systems  exist  for  both  predictions  relating 
to  discrete  alternatives  and  predictions  couched  in  terms 
of  continuous  distributions.  Reproducing  scoring  systems 
have  been  applied  to  weather  forecasts  and  classroom  tests. 
Their  application  to  political,  military,  and  economic  fore¬ 
casts,  however,  may  give  rise  to  the  following  problems: 
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(a)  Forecasters  may  not  attempt  to  maximize  their 
expected  gains. 

(’bl  Conflict  of  interpat  can  nnllntp  Riirh  a  bxjr— 

'  '  l  ‘  - - -  ~  J  ~ 

tem  as  it  can  any  other. 

(c)  A  great  many  forecasts  must  be  considered 
before  a  reproducing  scoring  system  will  reliably  distin¬ 
guish  accurate  from  inaccurate  forecasters. 

Examination  of  the  advantages  and  disadvantages  of 
various  probabilistic  scoring  systems  is  a  fit  subject  for 
experimental  investigation;  such  experiments  are  now  being 
designed  and  will  be  reported  in  due  course. 


-vii— 


CONTENTS 

PREFACE .  iii 

SUMMARY  .  v 

Section 

1.  WHY  PUT  FORECASTS  IN  PROBABILISTIC  TERMS?  ...  1 

2.  WHY  REPRODUCING  SCORING  SYSTEMS?  .  7 

3.  CLASSES  OF  REPRODUCING  SCORING  SYSTEMS .  14 

4.  APPLICATIONS  .  30 

5.  POTENTIAL  DIFFICULTIES  .  36 

Appendix 

A.  MONOTONICITY  .  43 

B.  NONUNIQUENESS  OF  «P .  45 

C.  AN  INVARIANCE  PROPERTY  CHARACTERIZING  THE 

QUADRATIC  SCORING  SYSTEM  .  47 

D.  BOUNDS  ON  DISCRIMINATION  .  51 

BIBLIOGRAPHY .  56 


i 

jL... ....  ...  h 


1.  WHY  PUT  FORECASTS  IN  PROBABILISTIC  TERMS? 


Almost  every  decision  we  make,  in  private  or  in  public 
life,  is  implicitly  based  on  forecasts  ot  luture  events 
and  conditions.  Naturally  we  are  willing,  therefore,  to 
put  considerable  effort  into  various  types  of  forecasting 
and  to  consult  experts  in  various  fields  in  order  to  im¬ 
prove  our  knowledge  of  the  likely  course  of  events  in  the 
future.  Everyone  recognizes  that  the  future  is  uncertain, 
however,  and  thus  it  would  seem  very  natural  for  forecasts 
to  be  generally  cast  in  probabilistic  terms.  For  example, 
in  forecasting  the  results  of  the  1968  presidential  elec¬ 
tion  one  might  have  said  "Nixon  .60;  Humphrey  .38;  Wallace 
.02,"  or  some  similar  apportionment  of  probabilities. 

This  quantitative  type  of  forecast  has  several  obvious 
advantages : 

(a)  It  provides  a  concise  expression  of  subjective 
uncertainty. 

(b)  It  provides  an  operational  self-rating  as  to  the 
degree  of  confidence  to  be  placed  in  the  forecast.  Some¬ 
one  who  has  not  studied  the  electorate  carefully  will  be 
more  likely  to  smear  his  probability  assignment  over  all 
possible  alternatives  than  someone  who  has  a  deeper  know¬ 
ledge  of  the  forces  at  work. 

(c)  It  provides  a  forecast  which  is  readily  usable 
by  those  using  the  tools  of  decision  theory  in  choosing 
between  alternative  courses  of  action. 
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(d)  It  provides  a  forecast  which  may  be  easily  com¬ 
bined  with  other  people's  forecasts  cnwhpH  £«  the  seme 
terms.  There  are  many  ways  in  which  this  may  be  done. 

The  most  obvious  method  is  simply  to  average  the  probabili¬ 
ties  ascribed  to  each  alternative  by  the  set  of  forecasters. 

o 

Edwards  and  Phillips  have  found  that  in  the  cases  they 
examined  subjects  tended  to  give,  more  accurate  probability 
estimates  if  asked  for  odds  rather  than  probabilities.  In 
a  personal  communication,  Edwards  has  suggested  using  the 
mean  of  the  logarithms  of  the  odds  as  a  good  way  of  com¬ 
bining  a  collection  of  probability  estimates.  A  method  of 
Eisenberg  and  Gale^'^  takes  as  the  consensus  the  odds 
which  would  be  arrived  at  if  the  forecasters  were  allowed 
to  place  bets  on  the  alternatives  as  at  a  race  track. 

These  three  methods  may  give  quite  different  results:  If 
two  forecasters  ascribe  probabilities  to  two  alternatives, 
and  one  ascribes  probabilities  (.5,  .5)  while  the  other 
ascribes  probabilities  (.1,  9),  then  the  various  consensus 

algorithms  give  the  following  results: 

Alternative  1  Alternative  2 
Parimutuel  method  .5  .5 

Average  probability  method  .3  .7 

Mean  log  odds  method  .25  .75 

The  question  of  which  consensus  technique  is  most  appro¬ 
priate  is  analogous  to  the  question  of  which  measure  of 
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central  tendency  (mean,  mode,  or  median)  for  a  body  of  data 
is  most  appropriate.  But  it  is  a  aupsHnn  can  be 

addressed,  experimentally  and  theoretically,  with  precision. 
Contrast  this  with  the  problem  of  creating  a  consensus 
out  of  a  handful  of  the  essays  normally  produced  as  fore¬ 
casts  by  political  pundits. 

These  advantages  seem  so  overwhelming  that  it  is  some¬ 
what  surprising  how  few  forecasts  are  actually  put  in  terms 
of  explicit  probabilities.  There  must,  therefore,  be 
factors  which  militate  against  putting  forecasts  in  such 
terms.  Some  of  these  factors  may  be  the  following: 

(a)  Forecasters  may  be  highly  knowledgeable  in  their 
field  of  expertise,  but  ignorant  of  the  language  of  pro¬ 
babilities  .  The  world  is  full  of  otherwise  intelligent 
people  who  think  that  chuck-a-luck  is  a  fair  game,  and 
who  have  very  little  skill  at  expressing  their  expectations 
in  terms  of  probabilities.  Numerous  psychological  experi¬ 
ments  have  exposed  the  difficulty  with  which  most  people 
handle  probability  concepts.  For  example,  Edwards^  has 
found  that  most  people  do  not  ’'understand”  Bayes'  theorem, 
in  the  sense  that  they  will  not  change  prior  subjective 
probabilities  as  much  as  they  should  in  response  to  new 
evidence;  that  is,  they  are  too  conservative.  On  the  other 
hand,  some  unpublished  experiments  of  Dalkey  and  Cochran 
indicate  that  students  overstate  their  confidence  that 
given  pairs  of  words  are  synonyms  or  antonyms . 
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(b)  Desire  for  vagueness  may  afflict  some  forecasters 
who,  like  elever  for cune— ce 1 lers ,  wish  to  make  "prophecies” 
which  appear  to  have  content  hut  which  actually  will  be 
"proved  correct"  independently  of  the  actual  course  of 
events.  Vagueness  may  also  be  a  bureaucratic  necessity 

at  times.  For  example,  a  National  Intelligence  Estimate 
ordinarily  represents  a  consensus  between  State,  the  CIA,  DIA, 
AEC,  and  FBI.  Ambiguous  verbal  formulas  are  sometimes 
required  in  order  to  produce  a  document  on  which  all  these 
agencies  can  agree. 

(c)  Users  may  pr  es-sure  forecasters  to  come  out  flatly 
behind  one  alternative  or  another.  To  some  extent  this  may 
be  a  maneuver  designed  to  counter  the  forecaster's  desire 
for  vagueness;  on  the  other  hand,  it  may  be  related  to  a 
failure  to  realize  that  an  honest  probabilistic  forecast 
gives  considerably  more  information  about  a  forecaster's 
views  than  simply  naming  the  single  alternative  which  he 
considers  most  likely.  Furthermore,  good  planning  should 
take  into  account  all  reasonable  contingencies  rather  than 
just  the  single  most  likely  one. 

(d)  Epistemological  difficulties  arise  when  you  speak 
of  the  "probability"  of  an  inherently  unique  event.  The 
frequentist  notion  of  probability  propounded  by  logical 
positivists  such  co  Von  Mises  can  give  meaning  to  concepts 
such  as  "The  probability  that  Nixon  will  win  the  1968  pre¬ 
sidential  election"  only  by  means  of  a  very  tortured  con- 
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struction.  The  Von  Mises  notion  of  probability  is,  in  fact, 
not  as  mathematically  precise  as  it  might  at  first  appear^ 
but  a  version  of  it  can  be  made  precise  wi*-h  the  aid 
of  recursive  function  theory  (see,  for  example,  DiPaola.^) 

It  would  be  false  sophistication,  however,  for  one  to  be 
inhibited  from  using  the  language  of  probability  in  the  way 
in  which  it  is  used  daily  by  actuaries  and  gamblers  simply 
because  a  particular  approach  to  the  foundations  of  proba¬ 
bility  has  difficulty  in  accounting  for  such  usage. 

(e)  Desire  for  credit,  in  a  sense  the  opposite  to  the 
desire  for  vagueness,  may  influence  forecasters  to  make 
specific  rather  than  probabilistic  predictions.  If  you 
predict  Nixon  will  win  the  election,  you  will  be  proved 
right  or  wrong  by  the  event.  If  you  say  Nixon  has  a  .6 
chance  of  winning,  what  will  the  actual  event  prove  about 
your  forecast? 

The  first  three  factors  above  are,  of  course,  problems 
with  forecasters  and  users  rather  than  problems  with  fore¬ 
casts.  These  difficulties  can  be  removed  by  merely  seeing 
to  it  that  human  beings  become  more  intelligent  and  more 
virtuous.  The  fourth  difficulty,  the  epistemological 
question,  is  a  deep,  complex  matter  about  which  reasonable 
men  may  differ.  It  is  a  problem  of  rather  abstract  philo¬ 
sophy  and  to  3eal  with  it  adequately  would  require  a  lengthy 
tract.  The  fifth  problem  is  the  subject  of  this  paper; 
how  should  you  apportion  credit  to  probability  forecasters 
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after  the  fact?  We  shall  champion  a  class  of  scoring 
systems  called  "reproducing  scoring  systems.1' 


2.  WHY  REPRODUCING  SCORING  SYSTEMS? 


2.1  What  Are  Reproducing  Scoring  Systems? 

Let  us  suppose  that  an  expert  reports  to  you  proba¬ 
bilities  rp  r2,  rn  that  each  of  n  mutually  exclusive 

alternatives  is  going  to  take  place  (and  at  least  one  of 
them  must  take  place,  so  that  if  the  expert  is  internally 
consistent  £r^  ■  1).  You  do  not  pay  him  at  once  for  this 
forecast,  but  wait  until  one  of  the  alternatives  (say, 
alternative  i)  actually  takes  place,  and  then  pay  him  an 
amount 


fi(rl*  *2,  •••.  rn) 

The  functions  f^  ought  to  be  chosen  in  such  a  way 
that,  if  the  expert  truly  believes  that  pp  p2,  ...  pn 
are  the  probabilities  of  the  event  in  question,  then  his 
perception  of  the  expected  payoff  from  his  prediction 

ppifi(rl'  r2’  rn> 

will  be  a  maximum  for  r-^  »  Pp  r2  -  p2,  etc.  If  a  reward 
system  ha.3  this  property,  it  is  called  a  "reproducing  scor¬ 
ing  system"  because  it  gives  the  expert  an  incentive  to 
report  his  tru*  beliefs  (to  "reproduce"  them). 

We  insist,  therefore,  that  the  maximum  at  r^  -  p^  be 
a  strict  maximum  (e.g.,  we  do  not  call  f  =  0  a  "reproducing 


! 

1 

, 

J 

3 

1 
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scoring  system"). 

2.2  Why  Use  Reproducing  Scoring  Systems? 

There  are  two  basic  reasons  to  use  reproducing  scoring 
systems : 

(a)  So  that  the  forecaster  will  provide  you  with  a 
true  report  of  his  judgments.  Even  if  there  is  no  overt, 
explicit  score  being  kept,  the  forecaster  will  perceive 
that  somehow  or  other  his  forecast  will  affect  his  personal 
future  in  some  way  (otherwise  he  will  be  unwilling  to  put 
any  effort  into  arriving  at  a  reasonable  forecast);  the 
link  which  he  perceives  between  his  forecast  and  his  per¬ 
sonal  fortunes  is  bound  to  have  a  strong  influence  on  both 
the  amount  of  effort  he  puts  into  his  forecast  and  also  on 
the  content  of  the  forecast.  The  use  of  a  reproducing 
scoring  system  is  an  attempt  to  explicitly  structure  this 
link  in  such  a  way  that  it  will  influence  the  forecaster 

to  make  a  sincere  effort  to  determine  what  is  likely  to 
happen,  and  then  accurately  report  his  view  to  the  user. 

(b)  So  that,  over  the  long  term,  you  can  sort  out 
better  forecasters  from  poorer  ones.  It  appears  that  today 
forecasters  are  judged  more  on  the  quality  of  their  writing 
and  on  their  political  sense  (what  do  people  want  to  hear?) 
than  on  the  quality  of  their  forecasts.  A  systematic 
scoring  system  would  provide  an  arena  in  which  good  fore¬ 
casters  could  excel  on  the  basis  of  how  accurately  they 
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perceived  the  future  rather  than  on  the  basis  of  "styling" 
and  "catching  the  word  market."  As  we  shall  see  below, 
however,  there  is  a  subtle  conflict  between  this  use 
of  reproducing  scoring  systems  and  the  use  given  above. 

2.3  Some  Horrible  Examples 

The  reader  at  this  point  may  feel  that  the  argu¬ 
ments  in  favor  of  eliciting  probabilistic  forecasts  and 
scoring  them  in  some  explicit  way  are  valid,  but  that 
any  "reasonable"  scoring  system  "will  do,"  and  that  it 
is  not  necessary  to  use  a  reproducing  scoring  system. 

In  this  section  we  will  present  three  examples  of  "rea¬ 
sonable"  scoring  systems  which  are  not  reproducing  scor¬ 
ing  systems,  and  point  out  the  distortions  in  forecasts  which 
which  they  will  encourage. 

(a)  The  simplest  possible  scoring  system  is  to  simply 

take 


£l<rl'  r2 . rn>  -  rl 

This  scoring  system  means  that  the  forecaster  will  maximize 
his  expected  score  by  picking  the  outcome  he  considers 
most  likely,  and  claiming  that  he  is  certain  it  will  take 
place.  That  is, 
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y.  m  1  (f  n  - f-  1 

i  —  ri  "Y 

*  0  otherwise. 

Over  a  period  of  time,  then,  this  scoring  system  will  tend 
to  benefit  the  incautious  forecaster  and  penalize  the  fore¬ 
caster  who  reports  his  true  feelings. 

(b)  In  some  of  Norm  Dalkey's  experiments  with  the 
Delphi  method,  the  respondents  produced  subjective  pro¬ 
bability  distributions  for  a  certain  quantity  whose  true 
value  was  known  to  the  experimenters.  The  question  arose 
how  to  rate  these  probability  distributions  for  accuracy, 
and  one  suggestion  was  the  following:  given  that  the  true 
value  was  x  and  the  probability  density  function  submitted 
by  a  respondent  was  r(y)dy,  pick  a  suitable  function  P  and 
let  the  respondent's  "score"  be 

f (r( ■ ))  -  1  -  J  P(|x-y|)r(y)dy 
D 

where  D  is  the  domain  over  which  r(y)  is  defined.  Let  us 
suppose  that  p(x)dx  is  the  probability  density  function 
which  the  respondent  "really"  has  in  mind.  Then  he  will 
perceive  his  expected  score  as  being 

E(f (r ( . )) )  -  1  -  J  r(y)  {J  P( ( x— y | )p(x)dx}dy 
D  D 
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No t* a  the  quantity  inside  the  curly  brackets  is 

a  function  of  y  alone.  Thus  the  respondent  may  maximize 
his  expected  score  by  concentrating  all  the  mass  of  r(y)dy 
at  the  value  of  y  which  minimizes  the  quantity  in  curly 
brackets.  For  example,  if  P ( | x — y [ )  -  |x-y|,  the  respondent 
should  concentrate  r(y)dy  at  the  median  of  his  subjective 
distribution;  if  P(|x— y|)  ■  | x— y |  he  should  concentrate 
its  mass  at  the  mean.  So  this  is  another  scoring  scheme 
which  tends  to  favor  the  incautious  "orecaster.  In  fact, 
example  (a)  above  may  be  viewed  as  a  special  ca:  ->  of  this 
example,  with 

P(|i-j|)  -  0  if  i  -  j 

-  1  if  i  +  j. 

This  scoring  procedure  was  not,  in  fact,  used  in  the 
Dalkey  experiments  because  its  undesirable  features  were 
recognized  in  time.  But  if  it  had  been  used  it  is  easy  to 
imagine  some  of  the  erroneous  conclusions  it  would  have 
engendered  concerning  the  relationship  between  the  spread  of 
subjective  probability  distribution  and  accuracy  (they  are 
related,  of  course,  but  not  as  strongly  as  this  scoring 
system  would  have  suggested). 

(c)  A  scoring  system  which  we  shall  call  the  "Colonel 
Blotto  scoring  system"  is  as  follows:  suppose  you  have  two 
or  more  forecasters  making  predictions  about  which  of  n 
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mu  tua  11  y  exclusive  alternatives  will  take  place.  Award 
one  point  to  the  forecaster  who  ascribes  the  highest  pro¬ 
bability  to  the  event  which  actually  takes  place,  and 
nothing  to  the  others.  After  a  reasonable  number  of  fore¬ 
casts  have  been  scored  in  this  way,  your  best  forecaster 
will  presumably  have  accumulated  the  greatest  number  of 
points . 

To  see  what  is  wrong  with  this  scoring  system  let  us 
consider  the  following  simple  case:  you  have  exactly  two 
forecasters  and  three  equally  likely  alternatives.  Suppose 
forecaster  #1  (correctly;  ascribes  probability  1/3  to  each 
alternative,  while  forecaster  #2  (through  either  foolish¬ 
ness  or  guile)  ascribes  probability  1/2  to  each  of  the 
first  two  alternatives  and  probability  0  to  the  third;  then 
the  second  (inaccurate)  forecaster  is  twice  as  likely  to 
win  the  point  as  is  the  first  (accurate)  forecaster.  If 
the  first  forecaster  guesses  what  kind  of  ascription  the 
second  forecaster  is  making,  he  can  switch  the  odds  back 
in  his  favor  by  ascribing  (.6,  .2,  .2)  to  the  three  pos¬ 
sible  outcomes,  and  so  on.  The  two  players  become  locked 
in  the  equivalent  of  a  two— person,  zero— sum,  continuous 
Colonel  Blotto  game  with  one  another,  and  it  becomes  im¬ 
possible  to  infer  from  their  ascriptions  what  they  actually 
believe  are  the  relative  probabilities  of  the  alternatives. 

This  game,  by  the  way,  was  introduced  by  Emile  Borel  in 

2 

one  of  the  earliest  monographs  on  game  theory.  He  con— 
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sidered  it  a  vpt  v  H-JffJpnlt*  crama .  hnt-  Sk  BnlnHnn  tjsc  found 

11  12 

thirty  years  later  by  Oliver  Gross.  ’  An  optimal  stra¬ 
tegy  is  the  following:  label  the  sides  of  an  equilateral 
triangle  of  unit  area  with  the  three  alternatives  under 
consideration;  inscribe  a  circle  in  the  triangle  and  erect 
a  hemisphere  upon  it;  choose  a  point  at  random  on  the  hemi¬ 
sphere  and  drop  a  perpendicular  from  it  to  the  plane  of  the 
triangle;  ascribe  to  each  alternative  a  probability  equal 
to  the  area  of  the  triangle  determined  by  the  foot  of  the 
perpendicular  and  the  side  corresponding  to  the  alternative 
in  question.  Performing  this  ritual  will  protect  you  from 
being  outguessed  by  your  opponent.  But  the  probability 
you  ascribe  to  a  given  alternative  has  little  to  do  with 
your  image  of  the  true  probability:  In  the  case  we  have 
been  discussing  it  is  equally  likely  to  be  any  number  be¬ 
tween  0  and  2/3. 


-14- 


O  /"»▼  k  r*r~*  Tin  An 

^  *  vjAjnutJ  ]JtJ  ur 


IVi^-T  £VV/l>UO  -Ll^Vl  J  VjUJVXINVJ  O  I  a  1  E>nD 


3.1  Preliminaries 

In  principle,  reproducing  scoring  systems  are  not 
necessarily  symmetric.  That  is,  if  you  assign  the  same 
probabilities  to  alternative  #1  and  alternative  #2,  your 
payoff  if  #1  actually  occurs  is  not  necessarily  the  same 
as  it  would  be  if  #2  actually  occurs.  One  could  conceive 
of  applications  in  which  this  asymmetry  would  be  useful, 
but  in  general  it  seems  more  reasonable  to  restrict  our 
attention  to  symmetric  scoring  schemes,  in  which  the  pay¬ 
off  depends  only  on  the  probability  assigned  to  the  alter¬ 
native  which  takes  place,  rather  than  on  the  label  attached 
to  that  alternative.  This  restriction  will  also  make  our 
mathematical  symbolism  considerably  simpler.  The  modifi¬ 
cations  which  would  be  required  to  extend  our  results  to 
asymmetric  scoring  systems  will,  in  most  instances,  be 
rather  obvious. 

Adding  a  constant  factor  to  a  reproducing  scoring  system 
yields  another  reproducing  scoring  system.  It  is  conven¬ 
ient  to  normalize  a  scoring  system  (to  be  applied  to  n 
alternatives)  in  such  a  way  that  the  pay-off  for  assigning 
a  probability  of  ^  to  the  alternative  which  actually  takes 
place  is  zero.  This  makes  a  certain  amount  of  heuristic 
sense:  If  confronted  with  n  alternatives  about  which  you 
know  absolutely  nothing,  the  "principle  of  equal  ignorance" 
suggests  that  you  should  assign  equal  probabilities  to  them 
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zero. 


In  what  follows,  therefore,  we  shall  assume  (unless 
the  contrary  is  specifically  stated)  that  any  reproducing 
serving  system  we  mention  is  symmetric,  and  normalized  so 
that  the  payoff  for  assigning  equal  probabilities  to  all 
alternatives  is  zero. 


3.2  Two  Alternatives 

Let  m  suppose  we  have  two  distinct  alternatives,  one 
of  which  must  occur  and  both  of  which  cannot  occur.  The 
expert  whom  we  consult  reports  to  us  probability  r^  for  the 
first  alternative  and  for  the  second.  Logic  compels  him 
to  assign  these  ,  labilities  in  such  a  way  that  r-^  +  ^  ■*  1- 
We  agree  to  pay  him  f(rp  if  the  first  alternative  comes 
true,  and  fC*^)  “  ffi-rp  if  the  second  comes  true  (a  nega¬ 
tive  payment,  as  usual,  corresponds  to  his  paying  us). 

Let  p^  and  P2  represent  the  probabilities  which,  in  his 
heart,  our  expert  actually  ascribes  to  the  two  alternatives; 
p^  and  P2  may  not  be  the  same  as  the  probabilities  which 
he  reports  to  us.  He  will  perceive  his  expected  gain  from 
the  exercise  to  be 


G  -  p^Cr^  +  (l-p1)f(l-r1) 


The  essence  of  a  reproducing  scoring  system  is  that 
the  expert  should  perceive  his  expected  gain  to  be  maxi— 
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mized  if  he  reports  to  us  the  probabilities  which  he  actu¬ 
ally  ascribes  to  the  events  in  question.  If  f  is  dl f fei — 

* 

entiable,  a  necessary  condition  for  this  is  the  following 
(in  which  we  suppress  subscripts  for  notational  convenience) 

(3.2.1)  rf’(r)  -  (l-r)f'(l-r)  -  0  0  <  r  <  1 

Now  define  cp(r)  as  follows: 

rf '  (r)  «  cp(r)  0  <  r  <  1 

Since  'i>(l-r)  -  cp(r)  -  (l-r)f'(l-r)  -  rf'(r)  «  0, 
we  may  also  write 

rf 1  (r)  -  cp(r)  +r[cp(l-r)  -cp(r)) 

and  thus,  since  f  (— )  *  0, 

IT  IT  1"“3T 

(3.2.2)  f(r)  -  J  dt  -  [J  cp(t)dt  +  J  cp(t)dt] 

We  have  shown  that  any  differentiable  reproducing 
scoring  function  f(r)  for  two  alternatives  may  be  put  in 
the  form  of  Eq.  (3.2.2).  In  the  next  section  we  shall  show 

*It  is  proved  in  App.  A  that  f  is  monotone  increasing; 
thus  it  must  be  differentiable  almost  everywhere. 


that  if  cp(x)  is  positive  except  on  a  set  of  measure  zero, 
function  defined  as  in  Eq.  (3.2.2)  is  a  reproducing 
scoring  function. 

3.3  The  Gambling  House  Construction  Method 

Suppose  our  expert  enters  a  gambling  house  in  which 
there  are  an  infinite  number  of  altruistic  gamblers.  Each 
gambler  corresponds  to  a  point  on  the  interval  from  zero 
to  one;  the  gambler  corresponding  to  x  is  willing  to  accept 
an  infinitesimal  positive  wager  cp(x)dx  that  alternative  #1 
will  occur,  and  offers  odds  of  one  for  x.  That  is  to  say, 
if  our  expert  loses  the  wager  he  loses  cp(x)dx,  while  if  he 
wins  the  wager  he  wins  dx  -  cp(x)dx. 

Obviously  our  expert  will  perceive  it  to  be  in  his 
interest  to  place  bets  with  those  gamblers  (and  only  with 
those  gamblers)  who  correspond  to  x  <  p^;  for  he  will  feel 
that  all  these  wagers  are  at  favorable  or  fair  odds,  while 
any  bets  with  gamblers  corresponding  to  x  >  p^  would  be 
placed  at  odds  which  appear  unfavorable  to  our  expert.  If 
he  then  visits  a  second  gambling  house  which  is  exactly 
like  the  first  except  that  the  gamblers  in  the  second  house 
accept  wagers  on  the  occurrence  of  alternative  #2,  then 
it  is  easy  to  see  that  his  net  gain,  if  alternative  #1 
occurs,  will  be 


This  defines  a  symmetric  reproducing  scoring  system, 

but  it  offers  better  than  zero  payoff  at  p^  ■  P2  “  fe*  To 

normalize  the  system  we  must  subtract  a  constant;  this  is 

equivalent  to  making  the  lower  limits  on  all  the  integrals 

\  instead  of  0.  Thus  we  are  left  with  Eq.  (3.2.2).  In  other 

words,  Eq.  (3.2.2)  "characterizes"  symmetric  reproducing 

scoring  systems.  This  theorem  was  discovered  independently 

18 

by  Shuford,  Albert  and  Massengill  at  id  by  Aczel  and 
Pfanzagl. * 

This  "gambling  house"  construction  method  has  the  very 
desirable  feature  that  it  generalizes  di*ectly  to  more  than 
two  alternatives.  The  same  discussion  as  we  went  through 
above  for  two  alternatives  shows  that,  if  cp(x)  i.T'  Av.y  pbpi— 
tive  function  whatsoever,  then  the  following  is  *.  •  induc¬ 

ing  scoring  system  for  n  alternatives: 

rri  n  rr1 

(3.3.2)  f^rp  r2,  rn)  -  cp(x)dx  _  £  J  cp(x)dx 

I  X  j-1  I 

n  n 

When  we  are  dealing  with  n  alternatives  and  a  symmet¬ 
ric  scoring  system,  we  may  write  the  scoring  function  in 
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the  form  £(r,v),  where  r  represents  the  probability  ascribed 
to  the  alternative  which  actually  occurs  and  v  is  some 
symmetric  function  of  the  r, 's  (r^,  of  course,  being  the 
probability  ascribed  to  the  i  alterative).  Let  us  look 
at  two  specific  reproducing  scoring  systems  which  are  gen¬ 
erated  by  Gq.  (3.3.2). 

(a)  The  quadratic  scoring  system.  Let  cp(x)  -  2x,  and 
we  get 

f (r,v)  •  2r  ~  v  -  v  -  jbr,2 

n  i-1  1 

This  scoring  system  was  used  by  Bruno  de  Finetti'*  in 
a  series  of  studies  involving  football  forecasts.  It  may 
be  shown  that,  in  the  case  n  *  2,  the  quadratic  scoring 
system  is  the  only  one  in  which  the  difference  between  the 
expected  pay-off  of  a  "perfect"  expert  and  of  a  given 
expert  is  a  function  only  of  the  difference  between  the 

"true"  probability  and  the  probability  ascribed  by  the 

.  * 
given  expert. 

(b)  The  logarithmic  scoring  system.  Let  cp(x)  -  1, 
and  we  get 

f (r,v)  -  log(n-r) 

We  shall  see  in  the  next  section  that,  for  n  >  2, 

See  Appendix  C. 
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this  is  essentially  the  only  differentiable  reproducing 
scoring  system  which  has  the  property  that  the  payoff 
depends  only  on  the  probability  ascribed  to  the  event 
which  actually  occurs  (and  not  on  the  probabilities  ascribed 
to  events  which  do  not  occur).  In  other  words,  it  is  the 
only  f(r,v)  which  does  not  depend  explicitly  on  v.  This 
fact  was  evidently  first  observed  by  A.  H.  Gleason  (see  15), 

but  the  only  published  proofs  I  have  found  are  by  Aczel  and 

1  18 
Pfanzaglx  and  Shuford,  Albert,  and  Massengill. 

This  scoring  system  relates  in  an  interesting  way  to 

information  theory.  An  expert  who  reports  a  spectrum  of 

probabilities  (r^  r2,  .  rn)  will  assess  his  expected 

payoff  as 

n 

Sr,  log(n-r. ) 
i-1  1  1 

which  is  essentially  a  constant  (logn)  minus  the  entropy 
of  his  partition.  In  other  words,  his  expected  payoff  is 
exactly  the  same  as  the  amount  by  which  he  is  able  to 
reduce  the  expected  information  in  the  event  itself. 

3.4  Many  Alternatives 

We  demonstrated  in  Secs.  2  and  3  above  that  all 
differentiable  two-alternative  reproducing  scoring  schemes 
ars  generated  by  the  "gambling  house"  method  (Eq.  (3.3.2)). 
The  following  example  shows  that  this  may  not  be 
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true  for  n— alternative  schemes: 


f(r,v) 


_ _ r_ 


V  = 


n 


2 


This  is  the  so-called  "spherical"  scoring  scheme  which 

22 

was  invented  by  Masanao  Toda .  If  n  ■  2  it  is  generated 
by  - 


cp(x) 


2  [x2  +  (l-x)2]3/2  * 


For  n  >  2,  however,  I  have  been  unable  to  find  a  cp(x)  which 
will,  when  plugged  into  Eq.  (3.3.2),  give  the  required  function. 

Although  we  are  not  able  to  give  a  characterization 
of  differentiable  n~alternative  reproducing  scoring  systems 
such  as  we  gave  for  2-alternative  systems,  we  are  able  to 
derive  a  useful  necessary  condition  that  such  systems  must 
fulfill,  analogous  to  Eq.  (3.2.1).  By  taking  appropriate 
directional  derivatives  we  are  able  to  determine  the  fol¬ 
lowing  set  of  n  equations  which  f(r,v)  must  satisfy: 


ll 

3.4.1)  r±  Hjr-r^+  {^rj  Jr-r^l  “  w  1  ”  1 


}  2,  • * • j  n 


where  w  is  a  symmetric  function  of  the  r's.  From  this 

-  n 

That  is,  symmetric  on  the  surface  r^  « 


1. 
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equation  it  is  possible  to  deduce  Gleason's  result  that 
the  logarithmic  scoring  scheme  is  essentially  the  only  one 
in  which  the  payoff  depends  only  on  the  probability 
assigned  to  the  alternative  which  actually  occurs.  For  if 
f(r,v)  is  a  function  of  r  alone  then 


af  s 

d'V 


0 


so  we  have 


r,  It  I _ *  w. 

i  arJr**r^ 

But  the  left-hand  side  is  a  function  of  r^  alone,  while 
the  right-hand  side  is  a  symmetric  function:  The  only 
symmetric  function  (if  n  >  2)  which  does  not  depend  on  the 
other  r's  is  the  constant  function.  Thus  w  -  c  and 

if  -  -£- 

dr  r 

From  which  it  follows  that  f  is  a  logarithmic  function 

of  r. 

The  logarithmic  reward  function  has  the  peculiarity 
that,  if  the  expert  in  question  assigns  zero  probability 
to  an  event  which  actually  takes  place,  the  penalty  levied 
against  him  is  infinite.  This  may  or  may  not  seem 


-23- 


indDDrODr  i  flt-.P  .  Tn  r-aaa  It-  ~ _ j _ ,  .  . 

— - —  — —  inoj/pLupiiaLe,  onuioru . 

18 

et  al.,  recommend  simply  truncating  the  reward  function 
at  some  small  e  >  0  and  giving  the  expert  the  same  reward 
for  r£  <  e  as  he  gets  for  -  e .  This  reward  structure 
is  not  a  reproducing  scoring  system  for  small  values  of  r, 
of  course.  Another  approach  which  is  more  consistent  with 
the  spirit  of  reproducing  scoring  systems  is  to  use,  in¬ 
stead  of  the  log  generator  cp(x)  -  1,  the  following  cp: 

(3.4.2)  cp(t)  -  tKn  t  < 

Kn 

■  1  elsewhere 

K  stands  for  some  large  constant.  This  cp  leads  to  the 
following  pay-off  function: 


f(r,  v)  -  log  nr  -  v  if  r  > 

“  KH 


-  Knr  -  logK  -  1  -  v  if  r  < 

Kn 


v  - 


E 

all  i  s . t. 


fl  -  ^Kn]" 

~  2Kri 


ri  < 


Kn 


#  1_ 

Thomas  L.  Hughes,  former  head  of  intelligence  for 
the  U.S.  State  Department,  tells  the  story  of  a  certain 
British  intelligence  officer  who,  upon  his  retirement  in 
1950  after  forty-seven  years  of  service  reminisced: 

Year  after  year  the  worriers  and  fretters  would  come  to 
me  with  awful  predictions  of  the  outbreak  of  war.  I  denied 
it  each  time.  I  was  only  wrong  twice." 
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It  is  easy  to  see  that  if  there  are  no  estimates  less  than 
1/Kn,  this  payoff  function  is  identical  witn  tne  loga¬ 
rithmic  one : 

3.5  Continuous  Cases 

Let  us  suppose  you  are  trying  to  elicit  from  your  ex¬ 
perts  an  estimate  of  what  the  population  of  Uganda  will 
be  in  1980.  One  way  to  proceed  would  be  to  ask  the  ex¬ 
perts  to  estimate  the  probability  that  the  population  would 
be  less  than  7  million,  between  7  million  and  8  million, 
between  8  million  and  9  million  and  so  on.  Unfortunately 
the  way  in  which  you  structure  your  alternatives  may  have 
an  undue  influence  on  the  experts'  answers,  and  if  you  wish 
to  combine  an  opportunity  for  precision  with  a  wide  range 
of  possible  answers  the  number  of  alternatives  may  become 
impossibly  large.  A  different  approach  which  seems  prefer¬ 
able  is  to  ask  the  experts  directly  to  specify  a  probability 
distribution  over  the  possible  future  population  of  Uganda 
in  some  way  (possibly  by  askin^  for  percentile  breaks,  for 
example)  and  then  score  this  continuous  distribution  directly. 
It  appears  that  any  n-alternative  reproducing  scoring  system 
may  be  converted,  by  a  limiting  process  as  n  becomes  large, 
into  a  scoring  system  for  continuous  distributions  on  the 
real  line.  The  details  of  the  limiting  process  will  differ 
from  case  to  case;  we  will  discuss  three  examples. 

(a)  Quadratic  Scoring  System.  Let  us  suppose  that 
the  domain  of  possible  answers  is  of  length  D,  and  is 


r(x)  is  an  estimating  probability  density  function  on  D, 
then  the  quadratic  reward  function  with  respect  to  this 
partition  would  be  (assuming  r(x)  is  approximately  con¬ 
stant  over  each  interval  Ax) 


3.5.1)  f(Xl) 


2r(x. )ax  -  E  [r(x. )Ax]2 
1  j  J 


Ax 

D 


The  last  term  in  this  expression,  the  only  one  which 
depends  explicitly  on  D,  is  only  a  constant  intended  to 
normalize  the  reward  function  so  that  assigning  equal 
probability  to  all  alternatives  gets  reward  zero.  Since 
in  this  case  it  is  convenient  to  be  noncommittal  about 
what  is  meant  by  "all  alternatives"  we  will  delete  this 
constant  (this  deletion  does  not,  of  course,  destroy  the 
reproducing  property  of  the  quadratic  scoring  system). 

For  the  reward  function  specified  by  Eq.  (3.5.1)  to  remain 
nonzero  as  Ax  -»  0,  it  is  necessary  to  divide  out  Ax.  The 
resulting  sequence  of  reward  functions  then  approaches, 
in  the  limit, 

r  2 

(3.5.2)  f(x)  -  2r(x)  -  J  [r(t)rdt 

D 


Let  us  illustrate  the  application  of  this  scoring 
system  to  two  estimates  of  the  population  of  Uganda  in 
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1930.  suppose  two  experts  are  asked  to  estimate  the  popu¬ 
lation  in  millions.  Expert  number  one  gives  the  following 
distribution: 

rl(x>  "  I  7  <  x  <  12 

«  0  elsewhere 

Suppose  expert  number  two  gives  the  following  distri¬ 
bution  : 

r2(x)  -  |  9  <  s.  <  10 

-Tj  8  <  x  <  9,  10  <  x  <  11 

-  0  elsewhere 

Then  we  can  calculate  their  payoffs  under  various 
contingencies  in  the  following  table: 


TRUE  VALUE 

PAYOFF  TO  EXPERT  #1 

PAYOFF  TO 

EXPERT  #2 

X 

< 

7 

-  .2 

_ 

.44 

7  <  x 

< 

8 

+  •  2 

- 

.44 

X 

V 

OO 

< 

9 

+  .2 

- 

.04 

9  <  x 

< 

10 

+  .  2 

+ 

.76 

10  <  x 

< 

11 

+  .2 

- 

.04 

11  <  X 

< 

12 

+  .2 

-- 

J'A 

12  <  x 

-  .2 

- 

.44 
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Note  that  although  both  experts  assigned  the  same 
probability  to  x  falling  between  8  and  9,  expert  #1  gets 
a  positive  reward  if  that  contingency  comes  to  pass  while 
expert  #2  gets  a  negative  reward.  The  justification  for 
this  is,  of  course,  that  expert  #2  thought  that  x  was  much 
more  likely  to  fall  between  9  and  10,  and  he  "dropped  a 
bundle"  betting  on  that  contingency. 

(b)  Spherical  Scoring  System.  Under  the  assumptions 
made  above,  the  spherical,  scheme  would  give  us 


r (x, )Ax 

(3.5.3)  f(xt)  -  .  1  • . 

v/i:r(Xj)2Ax2 


1 

7d7a'x' 


We  ignore  the  constant  term.  The  Ax's  cancel  one  an¬ 
other,  but  since  the  denominator  becomes  large  as  n  -  » 
we  must  divide  the  reward  by  */Ex~  in  order  to  keep  it  from 
approaching  zero  as  Ax  0.  We  then  have  in  the  limit  the 
following  scheme: 

(3.5.4)  f(x)  - - - 

*^Jr  (t)  2dt 

D 


Applying  Eq.  (3.5.4)  tr  the  distribution  given  in  the 


previous  example  generates  the  following  table: 
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TRUE  VALUE  PAYOFF  TO  EXPERT  #1  PAYOFF  TO  EXPERT  #2 


X 

< 

7 

0 

0 

7 

< 

X 

< 

8 

.447 

0 

8 

< 

X 

< 

9 

.447 

.302 

9 

< 

X 

< 

10 

.447 

.904 

10 

< 

X 

< 

11 

.447 

.302 

11 

< 

X 

< 

12 

.447 

0 

12 

< 

X 

0 

0 

(c)  Logarithmic  Scoring  System.  If  the  same  type  of 
limit  operation  carried  out  in  the  example  above  is  applied 
to  the  logarithmic  scoring  system  we  come  out  with  merely 
the  following: 


(3.5.5) 


f(x)  -  log  r (x) 


Applying  3.5.5  to  the  two  distributions  discussed  in 
the  preceding  example  leads  to  the  following  table. 

TRUE  VALUE  PAYOFF  TO  EXPERT  #1  PAYOFF  TO  EXPERT  #2 


x  <  7  —  00  —  00 

7  <  x  <  8  —  1.609  —  » 

8  <  x  <  9  -  1.609  -  1.609 

9  <  x  <  10  -  1.609  -  .511 

10  <  x  <  11  -  1.609  -  1.609 

11  <  x  <  12  -  1.609 

12  <  x  —  ®  ® 


Note  that  this  is  the  only  case  in  which  we  get  away 
from  the  difficulty  (if  you  consider  it  such)  of  giving 
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different  payoffs  to  experts  who  assess  the  same  proba' 
oiilty  tor  the  outcome  actually  occurring. 


I 


# 
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4 .  APPLICATIONS 

4 . 1  Introduction 

Reproducing  scoring  systems  have  been  studied,  mathe¬ 
matically  and  in  experimental  exercises,  for  about  fifteen 
years  now.  Only  recently,  however,  have  they  been  used  in 
practical  applications.  In  this  section  we  will  review 
some  of  these  applications  and  suggest  how  reproducing 
scoring  systems  could  be  applied  to  improve  political  and 
economic  forecasting  in  Delphi-type  procedures. 

4.2  Testing  Students 

All  of  us  have  occasionally  taken  true— false  tests 
in  which  the  instructor  claimed  to  "count  points  off  for 
guessing."  By  this  he  usually  meant  that  he  awarded  +1  for 
a  right  answer  and  —1  for  a  wrong  answer.  Any  student 
soon  realizes  that  this  is  not  really  penalizing  guessing 
at  all,  but  rather  that  he  will  maximize  his  expected 
score  by  putting  down  some  answer  to  every  question,  even 
if  he  thinks  that  the  chance  his  answer  is  right  is  only 
slightly  greater  than  the  chance  that  it  is  wrong.  A  stu¬ 
dent  putting  down  "true"  or  "false"  to  a  question  has  no 
way  of  indicating  his  degree  of  uncertainty  as  to  the 
correctness  of  his  answer.  This  in  turn  means  that  the 
teacher  who  uses  objective  tests  can  only  get  an  accurate 
reading  on  the  state  of  the  class'  knowledge  by  administering 
rather  long  tests. 
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It  would  seem  reasonable  to  Ipf  the  students  mark  a 
question  "true  with  probability  . 80"  instead  of  requiring 
them  to  say  "true  with  probability  1.00"  or  "true  with 
probability  0."  Indeed,  the  latter  approach  may  be 
subtly  training  students  to  tend  toward  extreme  opinions. 

The  use  of  reproducing  scoring  systems  offers  a  way  of 
inducing  them  to  put  down  their  true  subjective  feelings 
about  each  question,  and  this  should  vastly  increase  the 
amount  of  information  which  the  teacher  gets  about  a  class 
from  a  given  set  of  questions. 

This  application  has  been  vigorously  championed  by 
20 

Shuford  and  Massengill  and  they  have  marketed  materials 
which  enable  teachers  to  understand  and  apply  reproducing 
scoring  systems  in  the  classroom  without  getting  involved 
in  difficult  computations.  These  techniques  have  been 
applied  in  the  Academic  Instructors  Course  at  Air  Univer¬ 
sity,  and  in  pilot  programs  at  the  Air  Force's  Chanute  Tech¬ 
nical  Training  Center,  the  U.S.  Army  Signal  Center  and  School 
at  Fort  Monmouth,  the  Naval  Service  School  Commands  at 
Great  Lakes  and  at  San  Diego,  and  the  Naval  Air  Basic 
Training  Command  at  Pensacola . ^  Besides  offering  an  im¬ 
proved  technique  for  written  objective  tests,  reproducing 

scoring  systems  offer  a  good  method  for  predicting  whether 

19 

a  person  can  or  cannot  perform  a  given  practical  task  : 

You  simply  ask  him  what  he  thinks  the  probability  is  that 
he  can  successfully  perform  the  task.  You  may  then  ask 
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him  to  actually  do  it:  If  you  do,  then  he  is  rewarded 
(or  penalized),  whether  he  succeeds  or  fails,  according 
to  the  probability  of  success  he  predicted,  it  is  obvious¬ 
ly  not  necessary  to  ask  every  student  to  perform  every 
task  in  order  to  induce  them,  in  such  a  system,  to  give  the 
most  realistic  self-estimates  of  which  they  are  capable. 

Because  these  applications  are  so  new,  it  will  prob¬ 
ably  be  some  time  before  it  is  completely  clear  whether 
they  are  in  fact  as  advantageous  as  they  appear  to  be  in 
theory.  However  a  test  or  quiz  is  a  communication  device 
between  students  and  teacher;  by  modifying  objective  tests 
in  the  way  indicated  in  this  section  the  channel  capacity 
of  the  communication  device  is  increased  and  somehow  this 
must  be  a  good  thing. 

4.3  Weather  Forecasting 

Meteorologists  have,  for  some  time,  cast  many  of  their 

predictions  in  probabilistic  terms,  and  the  appropriateness 

13 

of  this  has  been  well  recognized.  More  recently,  repro¬ 
ducing  scoring  systems  (or  "proper  scoring  systems,"  as 

the  meteorologists  call  them)  have  been  proposed^' ^  and 
24 

actually  applied  to  the  evaluation  of  such  forecasts. 

Reproducing  scoring  systems  may  be  used,  of  course, 
not  only  to  compare  the  merits  of  different  human  forecasts, 
but  also  to  compare  the  quality  of  different  mathematical 
algorithms  for  preparing  probabilistic  forecasts,  and  for 
comparing  such  mechanical  forecasts  with  those  made  by 
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flesh  sr.d  bleed  human  uclugs .  In  iacn,  cne  logarithmic 
scoring  rule,  when  used  to  distinguish  which  of  two  prob¬ 
abilistic  models  gives  the  better  fit  to  the  observed 
data,  is  identical  with  the  well-known  maximum  likelihood 
criterion. ^ 

4.4  Delphi  Processes 

The  "Delphi  process"  is  a  name  which  has  been  given 

to  a  technique  in  which  a  group  of  individuals  independently 

estimate  some  quantity,  and  arrive  at  an  improved  estimate 

3  4 

by  carefully  controlled  communication  with  one  another.  ’ 
There  is  no  reason  why  the  quantity  they  are  attempting 
to  estimate  should  not  be  the  probability  of  some  future 
event. 

To  be  specific,  let  us  envision  the  following  means 
of  forecasting  the  news  events  of  197X.  A  panel  of  twenty 
or  more  knowledgeable  individuals  is  selected,  and  each 
is  asked  to  answer,  without  consulting  the  others,  the 
same  set  of  questions  about  probable  events  during  the 
year.  Some  typical  questions  might  be  the  following: 

"What  is  the  probability  that  North  Korean  forces  will 
destroy  or  capture  a  United  States  intelligence  vehicle 
during  the  first  six  months  of  197 X?" 

"What  is  the  probability  that  at  least  one  U.S.  astro¬ 
naut  will  be  killed  in  line  of  duty  during  197X?" 

"How  many  Republicans  will  be  elected  to  the  House 
of  Representatives  in  November? 
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troops  in  Vietnam  on  October  1  will  be 

less  than  100,000?  _ 

Between  100,000  and  300,000?  _ 

u  300,000  h  400,000?  _ _ 

"  400,000  "  500,000?  _ 

"  500,000  "  600,000?  _ 

Over  600,000?  _ 

After  the  participants  had  made  their  various  estimates, 
they  might  be  given,  say,  the  median  estimate  produced  by 
the  group  and  individually  asked  if  they  would  like  to 
change  their  estimate. 

It  would  be  explained  to  the  participants  that  they 
would  each  be  paid  for  their  efforts  at  the  end  of  197X, 
and  that  this  payment  would  be  in  proportion  to  their  in¬ 
dividual  scores  as  calculated  by  an  appropriate  reproducing 
scoring  system.  Enough  money  should  be  set  aside  to  pay 
them  fees  whose  expected  value  would  be  adequate  to  justify 
a  good  effort. 
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The  results  of  such  a  survey  might  make  an  entertain¬ 
ing  thirty-minute  TV  news  special;  but  would  they  be  of 
real  help  in  solving  any  real-world  problems?  Would  the 
forecasts  (or  the  ’’average"  forecast)  be  more  reliable 
than  forecasts  derived  by  other  methods?  Would  the  fore¬ 
casts  improve  significantly  if  such  a  survey  was  taken 
year  after  year,  with  greater  weight  being  given  to  the 
seasoned  experts  who  showed  high  accuracy  in  preceding 
surveys?  My  inclination  is  to  answer  all  of  these 
questions  in  the  affirmative,  but  barring  an  actual  trial 
there  is  no  way  of  answering  any  of  them  with  assurance. 
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5.  POTENTIAL  DIFFICULTIES 
5.1  Irrationality 

The  concept  of  applying  reproducing  scoring  systems 
to  the  respondents  in  a  Delphi  panel  is  founded  on  the 
notion  that  individuals  generally  behave  "rationally"  in 
the  sense  that,  under  conditions  of  uncertainty,  they  will 
act  in  such  a  way  that  they  maximize  their  expected  gains. 

Unfortunately  there  is  some  evidence  that  this  is  not  the 

20  21 

case.  *  For  example,  when  confronted  with  a  choice 
between  gambles,  some  people  will  choose  the  one  with  the 
highest  probability  of  winning  s ome thine  rather  than  the 
one  with  the  highest  expected  payoff;  others  will  choose 
the  gamble  with  the  highest  top  prize  regardless  of  how 
remote  the  chance  of  winning  it  may  be.  It  may  be  that 
the  sort  of  person  we  are  apt  to  select  for  a  Delphi  panel 
is  not  likely  to  behave  in  those  "irrational"  ways;  but 
if  we  do  have  the  bad  luck  to  find  a  large  number  of  such 
people  on  our  panel  then  it  is  obvious  that  any  reproducing 
scoring  system  will  have  highly  unpredictable  effects.  It 
may  provide  the  panel  with  incentives  to  exaggerate  or 
understate  their  true  subjective  probabilities,  and  the 
resulting  reports  may  thus  turn  out  to  be  less  meaningful 
than  if  we  had  used  no  scoring  system  whatsoever. 

To  guard  against  such  a  possibility  it  would  seem 
wise,  if  a  substantial  number  of  panelists  were  not  well- 
known  to  you,  to  include  certain  questions  on  your  Delphi 
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questionnaire  whose  sole  purpose  would  be  to  determine  -if 
the  respondent  was  "rational"  or  not.  Unfortunately  such 
questions  would  be,  perforce,  somewhat  artificial — such  as 
"What  is  the  probability  that  Army  will  kickoff  at  the 
beginning  of  the  Army— Navy  game  next  fall?" 

5.2  Unintended  Payoffs 

It  is  possible  for  a  panel  to  be  entirely  "rational" 
in  the  sense  of  the  preceding  section,  and  still  give 
distorted  responses  simply  because  the  "expected  gains" 
which  they  are  maximizing  include  factors  which  you  have 
neglected  to  consider.  For  example,  the  panelists  might 
have  special  interest  in  having  you  adopt  a  particular 
point  of  view  or  course  of  action. 

Indeed,  the  world  of  forecasting  is  today  awash  with 
conflict  of  interest  at  all  levels.  Your  family  physician 
predicts  that  if  you  let  him  perform  a  $3,000  operation 
you  will  never  be  troubled  with  leg  cramps  again;  your 
lawyer  predicts  victory  in  litigation  while  collecting  his 
retainer;  an  Air  Force  general  predicts  victory  in  Vietnam 
if  we  step  up  the  bombing;  an  electronics  firm  predicts 
marvelous  bombing  accuracies  in  five  years  if  we  spend 
more  on  research  and  development;  many  research  reports 
(including  this  one)  implicitly  or  explicitly  predict  high 
payoffs  from  further  research.  The  self-serving  nature 
of  many  prophecies  is  often  obvious,  but  it  is  more  danger¬ 
ous  when  it  is  subtle  and  well  hidden. 
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Xt  would  be  attractive  to  suppose  that  using  a  repro— 
due ine  scorinc  svstptn  and  mating  rewards  proportional  to 
score  would  do  away  with  the  problem  of  conflict  of  in¬ 
terest.  Of  course  it  would  not.  The  president  of  an 
electronics  firm  interested  in  building  a  new  bomb— nav 
system  is  going  to  predict  high  confidence  in  high  per¬ 
formance  for  the  new  system  regardless  of  what  kind  of 
explicit  scoring  system  you  apply  to  his  prophecies,  simply 
because  he  has  so  much  more  to  gain  from  getting  the 
contract  than  he  has  to  gain  from  any  payments  you  might 
realistically  make  for  accurate  forecasting.  Reproducing 
scoring  systems  do  make  it  possible  to  bring  in  whole 
new  groups  of  forecasters,  individuals  who  ordinarily 
would  not  bother  to  work  out  a  forecast  in  a  given  area 
because  it  does  not  directly  affect  their  personal  inter¬ 
ests.  Such  people  could  be  influenced  by  a  properly 
designed  system  to  work  out  forecasts  in  which  the  com¬ 
pulsion  of  personal  interest  was  all  toward  as  much  accu¬ 
racy  as  they  are  capable  of.  But  it  would  be  a  mistake 
to  suppose  that  holding  out  one  of  the  scoring  systems 
discussed  in  this  paper  to  the  same  people  who  are  now 
producing  self-serving  forecasts  would  suddenly  induce 
them  to  produce  unbiased  forecasts  in  the  future. 

Another  possible  unintended  payoff  is  similar  to  the 
"Colonel  Blotto"  scoring  system  we  discussed  previously. 

Suppose  you  set  up  a  Delphi  experiment  with  five 
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questions  and  a  thousand  participants ,  and  announced  that 
you  would  use,  say,  tne  quadratic  a  coring  system  cr.d  would 
award  $1,000  to  whoever  achieved  the  top  score.  This 
does  not  constitute  a  reproducing  scoring  system.  For 
suppose  each  question  was  a  two  alternative  forecast  with 
true  probability  of  .50:  Any  participant  who  per¬ 
ceived  the  true  probabilities  and  reported  them  would,  with 
certainty,  achieve  a  score  of  zero.  On  the  other  hand, 
a  crafty  speculator  who  expresses  "certainty  on  all  five 
questions  has  one  chance  in  thirty-two  of  making  an  un¬ 
beatable  score  of  2.5.  His  expected  score  is  -4,  but 
making  a  big  negative  score  leaves  him  no  worse  off  than 
the  cautious  clunk  who  scores  zero;  the  only  thing  that 
matters  is  his  chance  of  coming  in  first.  To  calculate 
the  true  optimal  policy  in  this  system  is  a  very  complex 
matter,  since  it  is  essentially  a  1,000— person  game  we 
are  discussing,  but  it  is  clearly  not  wise  to  make  "honest" 
forecas  ts . 

In  actual  cases,  of  course,  the  pressure  toward  ex¬ 
treme  forecasts  caused  by  this  top— dog  syndrome  is  apt 
to  be  considerably  more  subtle.  For  example,  one  of  the 
functions  we  have  repeatedly  stressed  for  scoring  systems 
is  distinguishing  good  forecasters  from  poor  ones.  If 
the  members  of  your  panel  get  the  impression  that  the  top 
forecasters  will  get  special  recognition  (more  than  just 
a  monetary  award  proportional  to  their  score)  this  will 
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introduce  a  distorting  tendency  into  their  forecasts. 

Finally  there  is  the  well-known  nonlinear  utility 
of  money.  Twenty  thousand  dollars  is  "worth"  less  than 
twice  as  much  as  ten  thousand  dollars.  The  conflict  of 
interest  problem  and  the  top-dog  syndrome  argue  in  favor 
of  making  the  monetary  award  for  accurate  forecasting 
substantial:  But  if  you  make  the  awards  too  substantial 
you  may  find  yourself  in  a  situation  where  some  of  your 
respondents  become  overcautious,  and  prefer  to  go  for 
small  but  relatively  certain  gains  in  preference  to  maxi¬ 
mizing  their  expected  gain  through  a  riskier  strategy. 

5.3  Slow  Discrimination 

Reproducing  scoring  systems  can  indeed  be  used  to 
determine  which  of  two  probabilistic  forecasters  is  the 
more  accurate,  but  it  may  take  many  trials  before  you  can 
place  much  confidence  in  this  determination. 

For  example,  suppose  two  experts  are  asked  to  predict 
the  probability  of  occurrence  of  n  different  events,  and 
each  event  has  true  probability  one-half.  Suppose  expert 
number  one  ascribes  probability  three— fifths  to  each  event. 
Suppose  we  are  using  a  quadratic  reproducing  scoring  system. 
Let  d(n)  denote  the  difference  between  the  first  expert's 
score  and  the  second's  (actually  the  first  expert  will, 
if  the  scoring  system  is  properly  normalized,  make  zero 
score,  so  — d(n)  will  be  the  second  expert's  score).  The 
quantity  d(n)  is  a  random  variable;  it  may  be  either 
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positive  or  negative,  but  its  mean  is  positive  and  the 
ratio  of  its  mean  to  its  standard  deviation  may  be  calcu¬ 
lated  to  be  J~iii  10.  This  means  that  even  after  100  pre¬ 
dictions  have  been  examined,  there  is  still  about  one 
chance  in  six  that  the  second  (less  accurate)  forecaster 
will  outscore  the  first.  After  400  predictions  the  chance 
of  this  happening  is  down  to  one  in  forty.  A  great  many 
forecasts,  in  other  words,  have  to  be  evaluated  before 
you  can  place  much  confidence  in  the  leading  scorer  being 
truly  the  best  forecaster. 

In  fact,  it  can  be  shown  that  under  any  reproducing 
scoring  system  whatsoever,  if  d(n)  represents  the  differ¬ 
ence  after  n  trials  between  the  score  of  an  expert  who 
ascribed  probability  y2  and  one  who  ascribed  probability  y-^ 
to  each  event  (and  the  true  probability  is  p),  then 


From  this  we  see  that  experts  will  be  discriminated 
more  quickly  if  they  are  asked  to  forecast  high  probability 
events  than  middling  probability  events,  but  that  in  any 
case  a  fairly  large  number  of  forecasts  are  apt  to  be 
required. 

See  Appendix  D. 


Appendix  A 
MONOTONICITY 

Let  f  be  a  reproducing  scoring  system  on  two  variables. 
Define 

H(x,  p)  -  pfL(x)  +  (1-p)  fQ(l-x) 

H(x,  p)  is  clearly  the  expected  payoff  of  an  expert 
assessing  the  probability  of  an  event  as  x  when  its  true 
probability  is  p. 

THEOREM  A:  If  p  <  x  <  y  jy:  y  <  x  <  P*  then 
H(x,  p)  >  H(y,  p) . 

PROOF  (Due  to  John  Lindsey):  By  definition, 

(A.l)  xfL(x)  +  (1-tc)  fQ(x)  >  xfx(y)  +  (l—x)  fQ(y) 

(A. 2)  yfx(y)  +  (1-y)  fQ(y)  >  yfx(x)  .+  (1-y)  fQ(x) 

Now  assume  p  <  x  ■:  y.  Then 
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Multiplying  A.l  by  a,  A. 2  by  p,  and  adding,  gives 
us 


(A.  3)  pfL(x)  +  (1-p)  fQ(x)  >  pf-^y)  +  (1-p)  fQ(y). 

Which  was  to  be  proved.  The  same  argument  works  for 
y  <  x  <  p. 

COROLLARY :  f-^(x)  is  a  monotone  increasing  function 

of  x.  is  a  monotone  decreasing  function  of  x. 

PROOF:  fx(x)  -  H(x ,  1).  fQ(x)  -  H(x,  0). 


Appendix  B 
NOflUNIQUENESS  OF  cp 


In  Sec.  3  we  showed  that  any  symmetric  differen¬ 
tiable  reproducing  scoring  system  on  two  alternatives  f 
could  be  put  in  the  form 

rr  rrr  r1"* 

B.l  f(r)  -  dt  —  LJ  cp(t)dt  +  cp(t)dtj 

where  cp(l-t)  «cp(t)  (i.e.,  cp  is  symmetric  about  £) .  On 
the  other  hand,  we  also  showed  that  any  function  of  the 
form  B.l  is  a  symmetric  reproducing  scoring  system  if  ep 
is  positive,  whether  cp  is  symmetric  about  4  or  not.  The 
explanation  for  this  seeming  contradiction  is  that  quite 
different  cp  may  give  rise  to  the  same  f. 

Direct  computation  shows  that,  if  g(t)  is  any  integ— 
rable  function,  then 

B.2  Ht)  *  t[cp(l-t)  —  cp (t) j 

is  a  solution  to  the  integral  equation 

B.  3  °  =  ^  dt  -  J1  Ht)dt  -  J1  r  Kt)dt 

4  4  4 

This  is  the  most  general  solution,  since  if  <Kt)  is 
any  solutiqn  differentiation  of  B.3  shows  that 
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Now  let  cp(t)  be  any  positive,  integrable  function 
defined  over  [0,1].  Then 

B.5  cpj Ct>  -  cp(t)  +  tfcp(l--t)  -  cp(t)J 


is  positive  and  symmetric  and  when  plugged  into  B.l  gives 
exactly  the  same  f  as  does  cp. 
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AN  INVARIANCE  PROPERTY  CHARACTERIZING 
THE  QUADRATIC  SCORING  SYSTEM 

It  would  seem  desirable  to  find  a  symmetric  reproducing 
scoring  system  on  two  alternatives  with  the  property  that 
an  individual's  expected  reward  would  depend  only  on  his 
accuracy,  and  not  upon  the  value  of  the  true  probability 
of  the  event  in  question.  A  little  reflection  shows  that 
there  can  be  no  reproducing  scoring  system  with  this  pro¬ 
perty:  We  may  assume  the  system  normalized  so  that  the 
payoff  for  perfect  accuracy  when  P]_  ■  P2  "  i  is  zero. 

Then  the  payoff  for  perfect  accuracy  when  p^  ■  1,  P2  ■  0 
must  also  be  zero,  so  f(£)  ■  f(l)  -  0  which  contradicts 
the  theorem  of  Appendix  A  (that  f  is  monotone  increasing). 

Even  though  it  is  impossible  to  find  a  nontrivial 
reproducing  scoring  system  which  makes  an  individual's 
expected  reward  independent  of  p^  it  is  possible  to  find  a 
scheme  which  makes  an  individual's  relative  expected  reward 
(i.e.,  the  difference  between  his  reward  and  that  expected 
by  a  hypothetical  perfect  expert)  depend  only  on  p— r  and 
not  on  p.  That  is  to  say,  the  difference  in  expected  re¬ 
wards  is  only  a  function  of  the  difference  between  the 
individual's  forecast  and  the  true  probability.  To  be  more 
specific,  let  us  define  E(r,p)  to  be  the  expected  return  of 
an  individual  who  ascribes  probability  r  to  an  event  whose 
true  probability  is  p.  That  is, 
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C ■ 1  E(rjP)  -  pf (r)  +  (l-p)f(l-r) 


Now  consider  the  quadratic  reproducing  scoring  system 


C. 2  f (r)  -  4r  -  2r2  -  |  \  -  2(l-r)2 


Simple  calculation  shows  that 
C.3  E(p,p)  -  E(r,p)  -  2 (p— r ) 2 


Thus  the  quadratic  scoring  system  has  the  property 
we  desire;  we  will  now  demonstrate  that  it  is  essentially 
the  only  scoring  system  which  does. 

THEOREM:  Let  £(r)  be  a  differentiable  symmetric 
reproducing  scoring  system  on  two  alternatives 
which  satisfies  the  following  conditions: 

(1)  E(p, p)  -  E(r,p)  is  a  function  of  (p-r) . 

(2)  f(i)  -  0. 

(3)  f(l)  -  1. 

Then 

f (r)  -  1  -  4(l-r)2 
PROOF:  Let 


C.4 


E(pjp)  -  E(r,p)  -  h(p-r) . 
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By  Eq.  (2)  we  see  that  E(^,p)  *  0,  so 


C.  5 


E(p> p)  -  h(p-fe) 


Combining  C.4  and  C.5  we  have 

C.6  h(p-i)  -  h(p-r)  -  E(r,p)  -  pf^r)  +  (l-p)f(l-r) 

Let  r  -  £  +  f.  ,  and  divide  both  sides  of  C.6  by  e, 
and  we  find 


;  7  h(p— -  h(p-j-e)  _  f(£-e)  -  f(&)  +  [f(l+e)  -  f (&-€)] 


Let  e  ~  0.  The  limit  on  the  right-hand  side  exists 
since  f  is  differentiable;  thus  the  limit  on  the  left- 
hand  side  exists  and  we  have 


C.8  h'(p-i)  -  f'<*)  (2  p-1). 

We  know  h(0)  -  0,  and  thus  solving  the  differential 
equation  C.8  gives  us 

C.9  h(p-fc)  "  f'(4)  (P~k)2 


since 
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0.10  ha)  -  E(lj  1)  -  E(i,l)  -  E(lj  1)  «  f(l)  -  1 

it  follows  that  f 1 (£)  —  4.  Thus 
C.  11  h(y)  m  4y2 

since 

C. 12  h(l-x)  -  E(l, 1)  -  E(x,l)  -  1  -  f(x) 

we  see  at  once  that 

C. 13  f(x)  -  1  -  4(l-x)2 

which  concludes  the  proof.  Note  that  we  did  not  use  f's 
differentiability  (or  even  continuity)  except  at  the  point 
x  -  jt> 
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Appendix  D 

BOUNDS  ON  DISCRIMINATION 
- 1 - — - - - - - — 

One  purpose  of  a  reproducing  scoring  system  is  to 
attempt  to  single  out  which  experts  (of  a  group)  are  the 
most  accurate  in  their  estimates  of  the  probability  of 
given  events  taking  place.  The  rating  system  itself  may, 
of  course,  give  ratings  which  are  faulty  due  to  ’'bad  luck." 
For  example:  if  expert  #1  predicts  that  a  given  fair  coin 
will  come  up  heads  on  its  next  toss  with  probability  .50, 
while  expert  #>2,  not  understanding  that  it  is  a  fair  coin, 
fairly  flipped,  makes  a  prediction  that  it  will  certainly 
come  up  heads,  and  the  coin  does  come  up  heads,  then  any 
of  the  rating  schemes  we  have  been  discussing  will  iden¬ 
tify  expert  #2  as  the  more  accurate  of  the  two.  Over  a 
long  series  of  flips,  of  course,  expert  #1  will  expect  to 
surpass  the  total  rating  of  expert  #2  (providing  the  latter 
continues  to  predict  heads  with  certainty).  In  this  sec¬ 
tion  we  will  discuss  the  probability  that  a  reproducing 
scoring  system  will  fail  (that  is,  that  it  will  rate  an 
inaccurate  expert  more  highly  than  an  accurate  one)  over 
a  given  set  of  predictions. 

In  general  this  probability  will  be  a  complex  function 
of  the  spectrum  of  true  probabilities  and  estimates  en¬ 
compassed  in  the  set  of  predictions  under  consideration. 

For  simplicity,  we  shall  limit  our  consideration  to  two- 
alternative  symmetric  scoring  systems,  and  assume  that 
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expert  #1  and  expert  #2  estimate  that  the  probability  of 
an  event  taking  place  as  and  y^,  respectively,  ard  that 
ths  trus  pirobsbili tiy  of  the  svsr.t  Is  p .  If  this  situstlon 
recurs  n  times,  what  is  tne  probability  that  expert  #1  will 
have  a  higher  total  score  than  expert  #2?  Let  d(n)  denote 
the  difference  between  the  total  score  of  expert  #1  and  the 
total  score  of  expert  #2  after  n  "trials."  Of  course  d(n) 
is  a  random  variable.  If  it  is  po3-?f'ive,  then  #1  outscores 
#2;  if  it  is  negative,  then  #2  outscores  #1.  It  ia  easy 
to  calculate  that 

(D.l)  mean  (d(n))  -  n  fpCf(yx)  -  f(y2))  +  (1“P) ( f (1— yx)  -  f(l-y2)}] 

“  n{f(l-y1)  -  f(l-y2)} 

+  np(f(yi)  -  f(y2)  -  f (l-yi)  +  f (1— y2) } 

(D.2)  variance  (d(n))  « 

np(i~p) C f (yL)  -  f(y2)  -  +  f(i-y2)}2 

Let  us  assume  that  y^  >  y2*  @o  that 
(D.3)  tf(yi)  -  f(y2)  -  fd-yp  +  f(l-y2)1  >  0. 


Then,  we  have 


(D.4)  mean  (d(iQ) _ 

stand,  dev.  (d(n)) 


f(yx)  -  f(y2) 
f(i-yx)  -  fd-y2) 


Let  cp  be  the  symmetric  function  which  (in  equation  B.l) 
defines  f.  Then 


(D.5) 


f(yx)  -  f(y2) 
f(i-yx)  -  '0- y2) 


n-y^cp(t)dt 

J  t 

l~y2 

I  1  cP(t)dt 
y2 

fyi  cp(t)dt 
y2  ^ 


By  the  mean  value  theorem 


yL  y 

(D.6)  l~yx  r  <  fyi  cpfOdt  <  l-y2  J  1  cp(t)dt 

v,  1-t  v.  t  y0  y„  l~t 
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Thus 


f(>^)  ~  £(y2J  1  ~  yl . 

-  f(i-y2)  <  ~Ji 

Plugging  inequality  (D.7)  into  Eq.  (D.4)  gives 
us 

(D'8)  J  pT^pT  lp~yl]  <  Sdfdiev?  ’^n))  <  V  pTT^pT  ^2^ 

This  inequality  is  useful  because  it  does  not  contain 
the  particular  reward  function  f.  The  absolute  value  of 
the  ratio  of  the  mean  to  the  standard  deviation  measures 
how  likely  the  rating  scheme  is  to  make  a  misrating.  If  n 
is  large  enough  that  d(n)  is  approximately  normal,  then  the 
probability  of  misrating  will  be  less  than  .025  if  the 
absolute  value  of  the  mean  over  the  standard  deviation  is 
greater  than  2.  One  might  imagine  that  it  would  be  possible 
to  choose  f  so  cleverly  that  this  ratio  would  be  large  even 
if  n  were  not  very  great.  Inequality  (D.8)  shows  that  no 
matter  what  reproducing  scoring  system  you  choose,  the  ratio 
will  fall  between  certain  limits.  Looking  at  Eq.  (D.6) 
shows  that  an  f  which  comes  close  to  either  limit  is  depen¬ 
dent  on  the  particular  y^  and  y2  involved,  so  it  appears 
that  there  will  be  no  one  scheme  which  is  the  best  dlscrim- 


(D. 7)  i _ i2  < 

-y2 


inator. 
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